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In objective Bayesian model selection, no single criterion has 
emerged as dominant in defining objective prior distributions. In- 
deed, many criteria have been separately proposed and utilized to 
propose differing prior choices. We first formalize the most general 
and compelling of the various criteria that have been suggested, to- 
gether with a new criterion. We then illustrate the potential of these 
criteria in determining objective model selection priors by consider- 
ing their application to the problem of variable selection in normal 
linear models. This results in a new model selection objective prior 
with a number of compelling properties. 

1. Introduction. 

1.1. Background. A key feature of Bayesian model selection, when the 
models have differing dimensions and noncommon parameters, is that results 
are typically highly sensitive to the choice of priors for the noncommon 
parameters, and, unlike the scenario for estimation, this sensitivity does 
not vanish as the sample size grows; see Kass and Raftery (1995), Berger 
and Pericchi (2001). Furthermore, improper priors cannot typically be used 
for noncommon parameters, nor can "vague proper priors" (see the above 
references, e.g., and the brief discussion in Section 2.2), ruling out use of the 
main tools developed in objective Bayesian estimation theory. 

Because of the difficulty in assessing subjective priors for numerous mod- 
els, there have been many efforts (over more than 30 years) to develop "con- 
ventional" or "objective" priors for model selection; we will term these "ob- 
jective model selection priors," the word objective simply meant to indicate 
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that they are not subjective priors, and are chosen conventionahy based on 
the models being considered. A few of the many references most related 
to this paper are Jeffreys (1961), Zellner and Slow (1980, 1984), Laud and 
Ibrahim (1995), Kass and Wasserman (1995), Berger and Pericchi (1996), 
Moreno, Bertolino and Racugno (1998), De Santis and Spezzaferri (1999), 
Perez and Berger (2002), Bayarri and Garcia-Donato (2008), Liang et al. 
(2008), Cui and George (2008), Maruyama and George (2008), Maruyama 
and Strawderman (2010). 

For the most part, these efforts were started with a good idea which was 
used to develop the priors, and then the behavior of the priors was studied. 
Yet, in spite of the apparent success of many of these methods, there has 
been no agreement as to which are most appealing or most successful. 

This lack of progress in reaching a consensus on objective priors for model 
selection resulted in our approaching the problem from a different direction, 
namely, formally formulating the various criteria that have been deemed 
essential for model selection priors (such as consistency of the resulting pro- 
cedure), and seeing if these criteria can essentially determine the priors. 

The criteria are stated for general model selection problems in Section 2, 
which also discusses their historical antecedents. To illustrate that applica- 
tion of the criteria can largely determine model selection priors, we turn to 
a specific problem in Section 3 — variable selection in normal linear models. 
The resulting priors for variable selection are new and result in closed form 
Bayes factors; for those primarily interested in the methodology itself, the 
resulting priors and Bayes factors are given in Section 4. 

1.2. Notation. Let y be a data vector of size n from one of the models 

(1) Mo:/o(y|«), M,:/,(y |«,/3,), i = 1,2, . . . , N - 1, 

where cx and the /3j are unknown model parameters, the latter having di- 
mension ki. Mq will be called the null model and is nested in all of the 
considered models. 

Under the null model, the prior is 7ro(Q;); under model Mi, and without 
loss of generality, we express the model selection prior as 

7ri(Q;,/3J = Tri{a.)TTi{Pi | a). 

Note that the parameter a occurs in all of the models, so that a is usu- 
ally referred to as the common parameter; the /3j are called model specific 
parameters. 

Assuming that one of the entertained models is true, the posterior prob- 
ability of each of the models Mj can be written in the convenient form 

(2) Pr(Mi I y) = ^ , 
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where Pjq is the prior odds Pjo = Pr(Mj)/Pr(Mo), with Pr(Mj) being the 
prior probability of model Mj, and Bjq is the Bayes factor of model Mj 
to Mq defined by 

(3) Bjo = ^^^^^ with ruj (y) = J /j (y | a, f3i)7rj {oc, f3j) da d(3j 

and mo(y) = / /o(y | cii.)t^o{oi.) da being the marginal likelihoods of model Mj 
and Mq corresponding to the model prior densities 7rj{a,Pj) and 7rQ{a). 
[Any model could serve as the base model for computation of the Bayes 
factors in (2), but use of the null model is common and convenient.] The 
focus in this paper is on choice of model priors 7ro(Q;) and '7Tj{a,Pj). 

2. Criteria for objective model selection priors. 

2.1. Introduction. The arguments concerning prior choice in testing and 
model selection in Jeffreys (1961) are often called Jeffreys's desiderata [see 
Berger and Pericchi (2001)] and are the precursors to the criteria developed 
herein. [Robert, Chopin and Rousseau (2009), is a comprehensive and mod- 
ern review of Jeffreys's book.] These and related ideas have been repeatedly 
used to evaluate or guide development of objective model priors; see, for ex- 
ample, Berger and Pericchi (2001), Bayarri and Garcia-Donato (2008), Liang 
et al. (2008) and Forte (2011). We group the criteria into four classes: basic, 
consistency criteria, predictive matching criteria and invariance criteria. 

2.2. Basic criteria. As mentioned in the Introduction priors for the non- 
common parameters /3j should be proper, because they only occur in the 
numerator of the Bayes factors BiQ, and hence, if using an improper prior, 
the arbitrary constant for the improper prior would not cancel, making BiQ 
ill defined. There have been various efforts to use improper priors and define 
a meaningful scaling [Ghosh and Samanta (2002), Spiegelhalter and Smith 
(1982)]; and other methods have been proposed that can be interpreted as 
implicitly scaling the improper prior Bayes factor [see details and references 
in Bayarri and Garci'a-Donato (2008)], but we are restricting consideration 
here to real Bayesian procedures. 

Similarly, vague proper priors cannot be used in determining the BiQ, 
since the arbitrary scale of vagueness appears as a multiplicative term in 
the Bayes factor, again rendering the Bayes factor arbitrary. Thus we have: 

Criterion 1 (Basic). Each conditional prior TTi{f3,i \ a) must be proper 
(integrating to one) and cannot be arbitrarily vague in the sense of almost 
all of its mass being outside any believable compact set. 

2.3. Consistency criteria. Following Liang et al. (2008), we consider two 
primary consistency criteria — model selection consistency and information 
consistency: 
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Criterion 2 (Model selection consistency). If data y have been gener- 
ated by Mi, then the posterior probability of Mi should converge to 1 as the 
sample size n—^oo. 

Model selection consistency is not particularly controversial, although it 
can be argued that the true model is never one of the entertained models, 
so that the criterion is vacuous. Still, it would be philosophically troubling 
to be in a situation with infinite data generated from one of the models 
being considered, and not choosing the correct model. A number of recent 
references concerning this criterion are Fernandez, Ley and Steel (2001), 
Berger, Ghosh and Mukhopadhyay (2003), Liang et al. (2008), Casella et al. 
(2009), Guo and Speckman (2009). 

Criterion 3 (Information consistency). For any model Mi, if {ym^T^ = 
1, . . .} is a sequence of data vectors of fixed size such that, as m—^oo. 



In normal linear models, this is equivalent to saying that, if one considers 
a sequence of data vectors for which the corresponding F (or t) statistic goes 
to infinity, then the Bayes factor should also do so for this sequence. Jeffreys 
(1961) used this argument to justify a Cauchy prior in testing that a normal 
mean is zero, and the argument has also been highlighted in Berger and 
Pericchi (2001), Bayarri and Garci'a-Donato (2008), Liang et al. (2008). One 
can construct examples in which a real Bayesian answer violates information 
consistency, but the examples are based on very small sample sizes and priors 
with extremely flat tails. Furthermore, violation of information consistency 
would place frequentists and Bayesians in a particularly troubling conflict, 
which many would view as unattractive. 

A third type of consistency has been proposed to address the fact that 
objective model selection priors typically depend on specific features of the 
model, such as the sample size or the particular covariates being considered. 

Criterion 4 (Intrinsic prior consistency). Let 7rj(/3j \cx,n) denote the 
prior for the model specific parameters of model Mi with sample size n. Then, 
as oo and under suitable conditions on the evolution of the model with n, 
TTi{(3i I a,n) should converge to a proper prior 7rj(/3j | cx). 

The idea here is that, while features of the model and sample size (and 
possibly even data) frequently affect model selection priors, such features 
should disappear for large n. If there is such a limiting prior, it is called an 
intrinsic prior; see Berger and Pericchi (2001) for extensive discussion and 
previous references. (Note that some have used the phrase "intrinsic prior" 
to refer to specific priors arising from a specific model selection method, but 
we use the term here generically.) 
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2.4. Predictive matching criteria. The most crucial aspect of objective 
model selection priors is that they be appropriately "matched" across models 
of different dimensions. Having a prior scale factor "wrong" by a factor of 2 
does not matter much in one dimension, but in 50 dimensions that becomes 
an error of 2^'^ in the Bayes factor. There have been many efforts to achieve 
such matching in model selection, including Spiegelhalter and Smith (1982), 
Suzuki (1983), Laud and Ibrahim (1995), Ghosh and Samanta (2002). 

The standard approach to predictive matching is modeled after Jeffreys 
(1961). For example, Jeffreys defined a "minimal sample size" for which 
one would logically be unable to discriminate between two hypotheses, and 
argued that the prior distributions should be chosen to then yield equal 
marginal likelihoods for the two hypotheses. Here is an illustration of this 
type of argument, from Berger, Pericchi and Varshavsky (1998). 

Example. Suppose one is comparing two location-scale models 



Intuitively, two independent observations {y 1,1/2) should not allow for dis- 
crimination between these models, since two observations only allow setting 
of the center and scale of the distribution; there are no "degrees of free- 
dom" left for model discrimination. Now consider the choice of prior (for 
both models) Tr{fi,a) = 1/a. It is shown in Berger, Pericchi and Varshavsky 



for any pair of observations yi ^ 2/2, so that the models would be said to be 
predictively matched for all minimal samples. The Bayes factor between the 
models is then obviously 1, agreeing with the earlier intuition that a minimal 
sample should not allow for model discrimination. 

This argument was formalized by Berger and Pericchi (2001) as follows. 

Definition 1. The model/prior pairs {Mi,-Ki} and {Mj,TTj} are pre- 
dictive matching at sample size n* if the predictive distributions mi{y*) 
and mj{y*) are close in terms of some distance measure for data of that 
sample size. The model/prior pairs {Mj,7rj} and {Mj,Trj} are exact pre- 
dictive matching at sample size n* if mi{y*) = mj{y*) for all y* of sample 
size n* . 




and 




(1998) that 
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One only wants predictive matching for "minimal" sample sizes, since, for 
larger sample sizes, the discrimination between models occurs through the 
marginal densities; they must differ for discrimination. 

Criterion 5 (Predictive matching). For appropriately defined "mini- 
mal sample size" in comparing Mi with Mj, one should have model selection 
priors that are predictive matching. Optimal (though not always obtainable) 
is exact predictive matching. 

In Berger and Pericchi (2001), minimal sample size was defined as the 
smallest sample size for which the models under consideration have finite 
marginal densities when objective estimation priors are used. Typically this 
minimal sample size equals the number of parameters in the model or, more 
generally, is the number of observations needed for all parameters to be 
identifiable. For model selection, however, minimal sample size needs to be 
defined relative to the model selection priors being utilized. Hence we have 
the following general definition. 

Definition 2 (Minimal training sample). A minimal training sample y* 
for {Mj, TTi} is a sample of minimal size n* >1 with a finite nonzero marginal 
density mj(y*). 

There are many possibilities for even exact predictive matching. We here 
highlight two types of exact predictive matching, which are of particular 
relevance to the development of objective model selection priors for the 
variable selection problem discussed in Section 3. 

Definition 3 (Null predictive matching). The model selection priors 
are null predictive matching if each of the model/prior pairs {Mj,7rj} and 
{Mo,7ro} are exact predictive matching for all minimal training samples yT 
for {Mi,7ri}. 

Definition 3 reflects the common view — starting with Jeffreys (1961) — 
that data of a minimal size should not allow one to distinguish between 
the null and alternative models. Null predictive matching arguments have 
also been used by Ghosh and Samanta (2002) and Spiegelhalter and Smith 
(1982) among others. 

Definition 4 (Dimensional predictive matching). The model selection 
priors are dimensional predictive matching if each of the model/prior pairs 
{MijTTi} and {Mj,TTj} of the same complexity/dimension (i.e., ki = kj) are 
exact predictive matching for all minimal training samples y* for models of 
that dimension. 

The next section gives the most prominent example of dimensional pre- 
dictive matching. 
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2.5. Invariance criteria. Invariance arguments have played a prominent 
role in statistics [cf. Berger (1985)], especially in objective Bayesian estima- 
tion theory. They are also extremely helpful in part of the specification of 
objective Bayesian model selection priors. 

A basic type of invariance that is almost always relevant for model selec- 
tion is invariance to the units of measurement being used: 

Criterion 6 (Measurement invariance). The units of measurement used 
for the observations or model parameters should not affect Bayesian answers. 

A much more powerful, but special, type of invariance arises when the 
family of models under consideration are such that the model structures are 
invariant to group transformations. Following the notation in Berger (1985), 
we formally state: 

Definition 5. The family of densities for y e M", ^ := {/(y | 0) : G 6} 
is said to be invariant under the group of transformations G := {g: M" M"} 
if, for every g € & and G 0, there exists a unique 0* G such that X = 
g(y) has density /(x | 9*) G 5^. In such a situation, 6* will be denoted g{0). 

There are two consequences of applying invariance here. The first is a new 
criterion: 

Criterion 7 (Group invariance). If all models are invariant under a group 
of transformations Gq, then the conditional distributions, 7rj(/3j | a), should 
be chosen in such a way that the conditional marginal distributions 



are also invariant under Gq. [Here, {a.,(5i,i) would correspond to 6 in the 
definition of invariance.] 

Indeed, the '7rj(/3j | a) could hardly be called objective model selection 
priors if they eliminated an invariance structure that was possessed by all 
of the original models. This can also be viewed as a formalization of the 
Jeffreys (1961) requirement that the prior for a nonnull parameter should 
be "centered at the simple model." 

The second use of invariance is in determining the objective prior for the 
common model parameters 7ri{a). Since all of the marginal models, fi{y \ a), 
will be invariant under Gq if the Group invariance criterion is applied, there 
are compelling reasons to choose the prior 



where vr^(-) is the right-Haar density corresponding to the group Gq. The 
reason is given in Berger, Pericchi and Varshavsky (1998), namely that under 
commonly satisfied conditions (satisfied for the variable selection problem — 





(6) 




for all i 
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see Result 2 in Section 3), use of a common vr (ct) for all marginal models 
then ensures exact predictive matching among the models for the minimal 
training sample size, as in the example given in Section 2.4. 

The most surprising feature of this result is that tt^ (a) is typically im- 
proper (and hence could be multiplied by an arbitrary constant) and yet, if 
the same TT^{a) is used for all marginal models, the prior is appropriately 
calibrated across models in the strong sense of exact predictive matching. 
(For any improper prior that occurred in both the numerator and denomina- 
tor of a Bayes factor, any arbitrary multiplicative constant would obviously 
cancel, but this is not nearly as compelling a justification as exact predic- 
tive matching.) The right-Haar prior is also the objective estimation prior 
for such models, and so has been extensively studied in invariant situations. 

Thus, for invariant models, the combination of the Group invariance crite- 
rion and (exact) Predictive matching criterion allows complete specification 
of the prior for a in all models. It is also surprising that this argument does 
not require orthogonality of a and P^ (i.e., cross-information of zero in the 
Fisher information matrix) which, since Jeffreys (1961), has been viewed 
as a necessary condition to say that one can use a common prior for a in 
different models [see, e.g., Hsiao (1997), Kass and Vaidyanathan (1992)]. 

There might be concern here as to use of improper priors, even if they are 
exact predictive matching, especially because of the discussion in Section 2.2. 
This concern is obviated by the realization that use of any series of proper 
priors approximating tt^ (a) will, in the limit, yield Bayes factors equal to 
that obtained directly from 7r^(a); see Lemma 1 in Appendix A.l. 

3. Objective prior distributions for variable selection in normal linear 
models. 

3.1. Introduction. We now turn to a particular scenario — variable selec- 
tion in normal linear models — to illustrate application of the criterion in 
Section 2. Consider a response variable Y known to be explained by ko vari- 
ables (e.g., an intercept) and by some subset of p other possible explanatory 
variables. This can formally be stated as a model selection problem with the 
following 2^ competing models for data y = (yi, . . . , y.„): 

Mo : /o(y I /3o, ^) = -^n(y I Xo/3o, ^^^I), 

(7) 

M^ : /.(y I A, /3o, a) = AA„(y | Xq/Jq + X,/3,,, a^I), i = 1, . . . , 2^ - 1, 

where Pq, cr, and the /3j are unknown. Here Xq is a n x /cq design matrix 
corresponding to the fco variables common to all models; often Xq = 1 so Mq 
contains only the intercept. Finally, the Xj are n x ki design matrices cor- 
responding to ki of the p other possible explanatory variables. We make 
the usual assumption that all design matrices are full rank (without loss of 
generality). Note that, if the covariance matrix is of the form cr^A with A 
known, simply transform Y so that the covariance matrix is proportional to 
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the identity; note that this does not alter the meaning of the /3's and hence 
the meaning of the models. Also, setting a = {(Bq^u) and N = 2^ puts this 
in the general framework discussed earlier, with Mq being the null model. 

The primary development is for the most common situation of a unknown 
and ko>l, but the simpler cases where either a is known or kQ = (i.e., the 
null model only contains the error term) are briefly treated in Section 3.5. 

In this setting and following Jeffreys desiderata, Zellner and Slow (1980) 
recommended use of common objective estimation priors for a (after or- 
thogonalization) and multivariate Cauchy priors for 7rj(/3j | a), centered at 
zero and with prior scale matrix fj^n(X'-Xj)~^; a similar scale matrix was 
used in Zellner (1986) for the g-prior. 

3.2. Proposed prior (the "robust prior"). It is useful to first write down 
the specific form of the prior that will result from applying the criteria. 
Indeed, under model Mj, the prior is of the form 

7rf(/3o./3^>^) = ^(/3o.^) X T^^i(3i\(3o,(^) 

8 

^ ' poo 

= cT-i X / {(3i I 0, g'^i)pf{g) dg, 
Jo 

where = Cov(/3j) = cr^(V*Vj)~^ is the covariance of the maximum likeli- 
hood estimator of /3j, with 

(9) Vi = (I„ - Xo(X*Xo)"'x|))X, 
and 

(10) pf{g) = a[p,{b + n)]''{g + 6)-('^+i) l|,>^^(fe+„)_fe}, 

(11) witha>0,6>0 and Pi> -, ^ ■ 

b + n 

Note that these conditions ensure that pf {g) is a proper density, and g is 
positive [necessary in (8)], so that T^fif^i \ I3q,<j) is proper, satisfying the 
first part of the Basic criterion of Section 2.2. The particular choices of 
hyperparameters that we favor are discussed in Section 3.4. 

The prior (8) has its origins in the robust prior introduced by Straw- 
derman (1971) and Berger (1980, 1985), for estimating a A;-variate normal 
mean /3 in the sampling scheme /3 ~ A4(/3, 5]). More precisely, the full con- 
ditional of Pi induced by (8) generalizes the above mentioned robust prior 
considering the sampling distribution of the maximum likelihood estima- 
tor, namely /3j ~ A4,(/9j, (^^(V^Vj)"-*^). The primary reasons for Strawder- 
man (1971) and Berger (1980, 1985) to consider such priors was that it 
results in closed form inferences, including closed form Bayes factors, and 
results in estimates that are robust in various senses. For this reason, we 
continue the tradition of calling (8) the robust prior and use a superindex R 
to denote it. Note also that priors of this form have been previously consid- 
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ered. The priors proposed by Liang et al. (2008) are particular cases with 
a = 1/2, 6 = 1, /jj = 1/(1 + n) (the hyper-g prior) and a = 1/2, 6 = n, pi = 1/2 
(the hyper-g/n prior). The prior in Cui and George (2008) has a = 1,6 = 1, 
Pi = 1/(1 +n). The original Berger prior for robust estimation is the par- 
ticular case with a = l/2,h = l,pi = {ki + l)/{ki + 3); closely related priors 
are those of Maruyama and Strawderman (2010), Maruyama and George 
(2008). 

Finally, it is useful to note that vr/^(/9i | Poi^) behaves in the tails as 
a multivariate Student distribution (already noticed for a particular case in 
Berger (1980), and the reason for its robust estimation properties). 

Proposition 1. Writing = /3*(V*Vi)/3j, 

lin, ^1 

llftlP^oo StkS(3 I 0,{ar{a)y/^piB*{b,a)/a,2a) 

where B*{b,a) = a^{b + n){VlYi)-\ 
Proof. See Appendix A. 2. □ 

In the model selection scenario, the thickness of the prior tails is related to 
the information consistency criteria, and is the reason Jeffreys (1961) used 
a Cauchy as the prior for testing a normal mean. Also, using this result, 
we can see that vr/^(/9j \ PQ,a) has close connections with the Zellner-Siow 
priors; in fact, for a = 1/2, b = n, pi = 2/7r and large n, T^f'ifBi \ Pq,(t), and 
the Zellner-Siow priors have exactly the same tails. 

3.3. Justification of model selection priors of the form (8). We will use 
the Group invariance criterion and Predictive matching criterion (along with 
practical computational considerations) to justify use of model selection pri- 
ors of the form (8). We first justify the use of 7r^(/3o,o") = 1/cr for the com- 
mon parameters and then justify the choice vr/^(/3 | f^o,o') for the model 
specific parameters. 

3.3.1. Justification of the prior for the common parameters. It is conve- 
nient, in this section, to consider a more general class of conditional priors, 

(12) 7T,{p,\l3o,a) = a-''^h 

where hi is any proper density with support 7^^'\ The robust prior is the 
particular case 

(13) /if (u) = I A4,(u I 0,g{YiYi)-')pf{g)dg. 
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It is shown, in Appendix A. 3, that ah models in (7) are invariant under 
the group of transformations 



The following establishes a necessary and sufficient condition on the condi- 
tional prior 7rj(/3j | Po,cr) for the Group invariance criterion to hold for this 
group. 

Result 1. The conditional marginals 



(14) /,(y|/3o,a)= / A4(y|Xo/3o + Xi/3„(T2l)7ri(/3, |/3o,a)d/3i 



are invariant under Gq if and only if TTi{(3^ \ (3Q,a) has the form (12). 
Proof. See Appendix A. 3. □ 

Based on the Group invariance criterion, Result 1 implies that, condition- 
ally on the common parameters (3^ and a, (3^ must be scaled by a, centered 
at zero and not depend on /3g [as was argued for simple normal testing in 
Jeffreys (1961)]. Note, in particular, that the robust prior in (8) satisfies the 
Group invariance criterion (although it is not the only prior that does so). 

Next, since each marginal model /j(y | (3Q,a) resulting from a prior in (12) 
is invariant with respect to Gq, the suggestion from Berger, Pericchi and Var- 
shavsky (1998) is to use the right-Haar density for the common parameters 
(/3o,cj), namely 



the right-Haar prior for the location-scale group. Using this, the overall 
model prior would be of the form 



The justification for the right-Haar prior in Berger, Pericchi and Varshavsky 
(1998) depends, however, on showing that it is predictive matching, in the 
sense described in the following result. 

Result 2. For Mi, let the prior ^^{fSQ, (3^, a) he of the form (15), where hi 
is symmetric about zero. Then all model/prior pairs {Mi,7ri} are exact pre- 
dictive matching for n* = /cq + 1. 

Proof. See Appendix A. 4. □ 

The conclusion of the above development is that the Group invariance 
criterion and Predictive matching criterion imply that model selection pri- 
ors should be of the form (15), with hi symmetric about zero. It would thus 
appear that the robust prior satisfies these criteria, as (13) is clearly sym- 
metric about zero. [Any scale mixture of Normals would also satisfy these 



Go = {g= (c, b) G (0, oo) X 7^'^« : ^(y) ^ cy + Xob}. 



7I"i(/3o'Cr) =7r (/3o,cj)=cr 



-1 



(15) 
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criteria, since the resulting h{-) would be symmetric about 0.] Note, however, 
that hf' has scale matrix proportional to (V*Vj)~^, and Vj in (9) requires 
both Xq and Xj , which would seem to indicate that a sample size of ko + ki 
is required. Hence, Result 2 would seem to apply to the robust prior only if 
ki = l. 

This is a situation, however, where the definition of a minimal sample 
size is somewhat ambiguous. For instance, suppose one were presented Xq 
and Xj for ko + ki observations for each model Mj , but that only ko + 1 of 
the Hi was reported for all models, with the rest being missing data. This is 
still a minimal sample size in the sense that it is the smallest collection of yi 
for which all marginal densities exist for the robust prior, and now Result 2 
applies to say that the robust prior is predictive matching for all models. 

3.3.2. Justification of the prior for the model specific parameters. While 
the robust prior is thus validated as satisfying the group invariance criterion 
and a version of the predictive matching criterion, there are many other 
model selection priors of form (15) which also satisfy these criteria. There 
are additional reasons, however, to focus on the robust priors with h^{\i) 
of form (13). The first is that only scale mixtures of normals seem to have 
any possibility of yielding Bayes factors that have closed form. While we 
have not focused on this as a necessary criterion, it is an attractive enough 
property to justify the restriction. There are, however, two other features 
of (13) that need justification: the use of the mixture density p^{g), and the 
choice of the conditional scale matrix (V*Vj)~^. 

The mixture density pf'{g) encompasses virtually all of the mixtures that 
have been found which can lead to closed form expressions for Bayes fac- 
tors; for example, Zellner-Siow priors are scale mixtures of normals, but 
with a different mixing density which does not lead to close-form expres- 
sions. [The choice of mixing density in Maruyama and George (2008) is 
a very interesting exception, in that it leads to a closed form expression 
for a different reason than does pf'{g)] So, while not completely definitive, 
pf'{g) is an attractive choice. The choice of (V-Vj)~^ as the conditional scale 
matrix seems much more arbitrary, but there is one standard argument and 
one surprising argument in its favor. 

The standard argument is the measurement invariance criterion; if the 
conditional scale matrix is chosen to be (V*Vj)~^, it is easy to see that 
Bayes factors will be unaffected by changes in the units of measurement of 
either y or the model parameters. But there are many other choices of the 
conditional scale matrix which also have this property. 

A quite surprising predictive matching result that supports use of (V*Vj)~ 
as the conditional scale matrix is as follows. 

Result 3. For Mi, let the prior he as in (15), where hi is the scale 
mixture of normals in (13). The priors are then null predictive matching 
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and dimensional predictive matching for samples of size ko + ki, and no 
choice of the conditional scale matrix other than (V*Vj)~^ (or a multiple) 
can achieve this predictive matching. 

Proof. See Appendix A. 5. □ 

This is surprising, in that it is a predictive matching result for larger 
sample sizes (/cq + ^i) than are encountered in typical predictive matching 
results, such as Result 2. That it only holds for conditional scale matri- 
ces proportional to (V*Vj)~^ is also surprising, but does strongly support 
choosing a prior of the form (8). 

3.4. Choosing the hyper parameters for pf{g). 

3.4.1. Introduction. The Bayes factor of Mj to Mq arising from the ro- 
bust prior TT^ in (8) can be compactly expressed as the following function 
of the hyperparameters a, b and pi'. 



(16) = Qro^""'"^^'^^[A(- + br'^/'AF, 



where APj is the hypergeometric function of two variables [see Weisstein 
(2009)], or Apell hypergeometric function 

ki ki + ko — n n — ko , , ~ 1) b — Q'^^ 



AP,; 



a + —\ , — - — ;o + 1 + — ; ■ 



2' 2 ' 2 ' 2' pi{b + ny p,{h + n) 

and QiQ = SSEj / SSEq is the ratio of the sum of squared errors of models Mi 
and Mq. The details of this computation are given in Appendix A. 6. 

Having a closed form expression for Bayes factors is not one of our formal 
criteria for model selection priors, but it is certainly a desirable property, 
especially when realizing that one is dealing with 2^ models in variable 
selection. 

The values for the hyperparameters that will be recommended are a = 1 /2, 
6=1 and pi = {hi + ko)~^ . The arguments justifying this specific recommen- 
dation follow. 

3.4.2. Implications of the consistency criteria. The consistency criteria 
of Section 2.1 provide considerable guidance as to the choice of o, b and 
the Pi. In particular, they lead to the following result. 

Result 4. The three consistency criterion of Section 2.3 are satisfied 
by the robust prior if a and pi do not depend on n, lim„_j.oo ^ = c > 0, 
lim^^oo Pi{b + n) = oo and n>ki + kQ + 2a. 

This result follows from (18), (20) and (22) below, which are presented 
as separate results because they can be established in more generality than 
simply for the robust prior. 
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Use of model selection consistency. Suppose Mj is the true model, and 
consider any other model Mj . A key assumption for model selection consis- 
tency [Fernandez, Ley and Steel (2001)] is that, asymptotically, the design 
matrices are such that the models are differentiated, in the sense that 

(17) /3'VKl-P,)V.ft 

n->oo n 

where = Vj(V* Vj)-iV* . 

Result 5. Suppose (17) is satisfied and that the priors TTi^P^, (3^,a) are 
of the form (15), with hi{u) = J Mk,{u \ 0,g(ylVi)-^)pi{g)dg. If the pi{g) 
are proper densities such that 

POO 

lim / {l + gr''^/^p^{g)dg = 0, 

model selection consistency will result. 

Proof. The proof follows directly from the proof of Theorem 3 in Liang 
et al. (2008) and is, hence, omitted. □ 

Corollary 1. The prior distributions in (8) are model selection con- 
sistent if 

(18) lim pi{h + n) = oo. 

n— >oo 

Proof. See Appendix A. 7. □ 

Use of intrinsic prior consistency. Related to (17) is the condition that 

(19) lim -VfVz = Hi 

n— >-oo 77, 

for some positive definite matrix H^. This would trivially happen if either 
there is a fixed design with replicates, or when the covariates arise randomly 
from a fixed distribution having second moments. 

Result 6. If (19) holds, 

(20) a and pi do not depend on n and > c, 

n 

then the conditional robust prior '/r^(/9j | Po,(7) in (8) converges to the fixed 
intrinsic prior 

roo 

(21) 7ri(/3,|/3o,a)= / ^^kM\^^9*^^^~^)P^{9*)dg\ 

Jo 

where m{g*) = a[pi{c + l)]''{g* + cr^'^+^h^g^^^^^.+^y.y 
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Proof. Changing variables to g* = g/n, the integral in (8) becomes 
^°°AA,^(/3.|0,,v(iv*V.)")a[p.Q + l) 

^V^n) ^{g*>P.{b/n+i)-b/n}dg ■ 



For large n and using (19) and (20), it is easy to find an integrable function 
dominating the integrand, so the dominated convergence theorem can be 
applied to interchange the integral and limit, yielding the result. □ 

Use of information consistency. For the variable selection problem, it is 
easy to see that 

sup fi{y 1 /3o, A,^) = (27rSSE,/n)-"/2exp(-n/2) 

for model M/. Hence, for any given data set y, the estimated likelihood ratio 
in (4) is 

A.o(y) = Q.o(y)^"/', 

where Qio{y) is the ratio of the residual sum of squares of the two mod- 
els for y. Therefore, having a sequence of data vectors {ym} such that 
limm_j.oo Ajo(ym.) = C!0 is equivalent to having a sequence of data vectors 
such that hmm^oo Qio(ym) ^ 0. 

Result 7. // pi >b/{b + n), the prior in (8) results in an information 
consistent Bayes factor for Mi versus Mq , if and only if 

(22) n > ki + fco + 2a. 

Proof. See Appendix A. 8. □ 

3.4.3. Specific choices of hyperparameters. 

The choice of a. Note that, with ki > kj and n>ki + kQ + 1, the Bayes 
factor Bij between Mj and Mj exists. It is desirable to have information 
consistency for all such sample sizes, in which case (22) would require a < 
1/2. The choice a = 1/2 is attractive, in that it coincides with the choice in 
Berger (1985) and, with this choice, vr/^ has Cauchy tails, as do the popular 
proposals of Jeffreys (1961) and Zellner and Slow (1980, 1984). 

Additional motivation for this choice can be found by studying the be- 
havior of BiQ when the information favors Mq, in the sense that QiQ — > 1. 
Indeed, Forte (2011) shows that the limiting value of Biq is then bounded 
above by 2a/{2a + ki) for any sample size, including a small sample size such 
as ko + ki + 1. A small value of a would imply strong evidence in favor of Mq , 
which does not seem reasonable when the sample size is small. In contrast. 
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the recommended choice would yield a bound of 1/(1 + ki), which certainly 
favors Mq, but in a sensibly modest fashion when the sample size is small. 

The choice of b. To understand the effect of b and the pi on the robust 
prior, it is useful to begin by considering the approximating intrinsic prior 
in Result 6, which depends on the hyperparameters only through the mixing 
distribution pf {g*), which for a = 1/2 is given by (when b/n^ c) 

(23) pf{g*) = i[p.(c+ l)]^/'(5* + c)-^/^l|,.>,,(,+i)_,}. 

This is a very flat-tailed distribution with median 4p,(l + c) — c. Because it is 
so flat tailed, the choice of c in {g* + c)"'^/^ is not particularly influential, so 
that the main issue is the choice of the median. For selecting a median, how- 
ever. Pi and c are confounded; that is, we do not need both. For simplicity, 
therefore, we will choose c = (i.e., b such that b/n 0). 

If 6/n ^ c = 0, the intrinsic prior does not depend at all on b. Furthermore, 
there is very little dependence on 6, in this case, for the actual robust prior, 
as was verifled for moderate and small n in Forte (2011) through an extensive 
numerical study. 

Since any choice of b for which 6/n — t- makes little difference, it would be 
reasonable to make such a choice based on pragmatic considerations. In this 
regard, note that the choice b = l has a notable computational advantage, 
in that the hypergeometric function of two variables, APj, then becomes the 
standard hypergeometric function of one variable [Abramowitz and Stegun 
(1964)]. We thus choose 6 = 1. 

The choice of pi. This is the most difficult choice to make, since there is 
only limited guidance from the various criteria. To review (and assuming 
b = 1), we have that pi>\/{l + n) (so that (7 > 0); lim^-^-oo PiO- + n) = 00 
(for model selection consistency); and pi should not depend on n (for there 
to be a limiting intrinsic prior). Also note that n is necessarily greater than 
or equal to + ki for the robust prior and marginal likelihood to exist; 
supposing we wish to choose pi so that the conditions are satisfied for all 
such n, these restrictions only imply that 

Pi must be a constant (independent of n) and Pi > 1/(1 + fco + A;j). 

We present two arguments below for the specific choice pi = \/{kQ + ki). 

Argument 1. Consider the Bayes factor BiQ of Mi to Mq. In Result 3, 
it was established that BiQ = 1 for a sample of size n = ki + ko, but a natural 
question is — what should we expect for a sample of size n = ki + ko + 1? Can 
a single additional observation provide much information to discriminate 
between Mj and Mq? Intuition says no. To quantify the intuition, consider 
the situation in which QiQ 1, which corresponds to information being 
as supportive as possible of Mq. It is straightforward to show that, when 
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n = ki + kQ + l, 

(24) lim i?,^ = T^iP^ik, + ko + 2)]^"^/^ 

As we should not expect a single extra observation to provide very strong 
evidence, even in the case that Qio ~^ li the implication is that we should 
choose Pi to be as small as is reasonable. The choice pi = l/{ko + ki + 1) is 
the minimum value of pi and is, hence, certainly a candidate. 

Argument 2. Consider the intrinsic prior defined by (21) and (23). 
Note that we have chosen c = (through the choice of 6=1) and, after 
making the transformation g = g* /pi, the intrinsic prior can be written 

/•oo 

(25) 7r,(/3o,A,a) = a-i X / ^^kM\0,gp^S~^)Pi{g)dg, 

Jo 

where Pi{g) = (l/2)(^)~'^/^l{^>i}. Thus we see that, in the intrinsic prior ap- 
proximation to the robust prior, pi can be interpreted as simply a scale factor 
to the conditional covariance matrix. This helps, in that there have been pre- 
vious suggestions related to "unit information priors" [Kass and Wasserman 
(1995), Berger, Bayarri and Pericchi (2012)]. For instance, Berger, Bayarri 
and Pericchi (2012) consider the group means problem defined as follows: 
the observations are 

Vij = Pi + Eij, i = l,...,k and j = 1, . . . , r, 

with i.i.d. Eij ~ A^(- | 0,0"^). Thus there are k different means, pi, and r 
replicate observations for each. Applying the robust prior to this example 
(considering the full model with all pi) results in a conditional covariance 
matrix in (25) of pkl, which is much too diffuse if k is large and p is not small. 
Selecting p=l/k, on the other hand, restores a "unit information" prior. 
Here ko = 0, so the choice p = 1/k is equivalent to the overall choice pi = 
1/(^0 + ki). This overall choice is obviously very close to earlier suggested 
l/{ko + k^ + l). 

3.5. Two simpler cases. We conclude with discussion of the modifica- 
tions of the robust prior that are needed when /3q = or when a is known. 

3.5.1. When (3q = and a is unknown. When (3q = 0, the robust prior 
distribution is 

POO 

vrf (/3„a) = 7r(a) X vrf (/3, \ a) = x / MkM I 0, g^,)pf{g) dg, 

Jo 

where = Cov(/3j) = (T^(X*Xj)~^, the covariance of the maximum likeli- 
hood estimator of (3^ and, as before, 

Pfig) = + nTig + 6)-('^+^), g>p^{b + n) - b. 
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The corresponding Bayes factor is as in (16) with ko = 0; when we choose 
a = 1/2, 6=1 and pi = l/{ki + ko), it assumes the simpler form in (26), 
again with kQ = 0. 

In regards to the group invariance criterion, when /3o = the models are 
invariant under the scale group of transformations. Go = {y ^ cy,c > 0}, 
and it is easy to show that vr(/3j | f3Q,cr) still needs to be a scale prior, as 
in (12), to preserve the invariance structure; also, the use of 7r(cr) = 1/cr is 
again justified by predictive matching, as it is the Haar prior for the group. 
Null and dimensional predictive matching also hold as well as the various 
consistency criteria. 

3.5.2. When a is known and (3q ^ 0. When a is known, the robust prior 
becomes 

/•oo 

7rf(/3„/3o,cT) = 7r(/3o) x ^f(/3J/3o) cc / A4,(A I 0,g^i)pfig)dg, 

Jo 

where = Cov(/3,j) = (j^(V- Vj)^-'^, and p^{g) is as before. 

The models are now invariant under the location group Gq = {y — >■ y + 
Xob,b G 7^^■o}, and it is easy to show that '/r(/3j | (3^) just needs to be inde- 
pendent of Pq to preserve the invariance structure; the use of the Haar prior 
7r(/3g) = 1 is again justified through predictive matching arguments. 

The Bayes factor can be expressed as 



iO 



{g + ir''/'A^/^'^'^-'^Mg)dg, 



where Aqj = exp(— [SSEq — SSEj]/(2cr^)). This is curiously difficult to ex- 
press in closed- form in general but, for our preferred choice 6=1, change of 
variables to h = l/{l + g) yields 



B 



iO 



(g + l)"^-'/2A(y(^+^)-^)a(p.(l + n)ng + l)-('^+i)l|,>,^(i+„)„i} dg 

^l/[ft(l+n)] 


1 / [SSEo - SSE,i ^ -('^-2+fc,/2) 



a(p,(l + n))"Ao/ 



^(a-l+fc,/2)g-h[SSEo-SSE,]/(2a2)^^ 



i(p,(l + n))%-/(^ 

k. 



a + 



a + 



2a2 

h [SSEq-SSE,] 
2 ' 2a^pi{l + n) 



where r(z^i,z^2) is the incomplete gamma function. 



r(z^i,z.2) 



dt. 



All of the properties of the procedures for the a unknown case also hold 
here, except for null predictive matching. 



CRITERIA FOR BAYESIAN MODEL CHOICE 



19 



4. Methodological summary for variable selection. Although the pri- 
mary purpose of the paper was to develop the criteria for choice of model 
selection priors and study their implementation in an example, the method- 
ological results obtained for the problem of variable selection in the normal 
linear model, as outlined in Section 3.1, are of interest in their own right. 
For ease of use, we summarize these developments here. 

Using the notation of Section 3.1, the prior distribution recommended for 
the parameters under model Mj is 

POO 

nf{l3o,Pi,a) = a-^ x / A4,(/3, | 0, g^,)pf{g) dg, 
Jo 

where = a2(V*V,)^\ = (I„ - Xo(X*Xo)-iX* )X„ and 



Piig) = 2 



(1 



n) 



.1/2 



(5 + 1) 



-3/2 



{g>{ki+ko)-Hl+n)-l}- 



The resulting Bayes factors have closed form expressions in terms of the 
the hypergeometric function, namely 



iO 



(26) 



n+ 1 



-1 -ki/2 



ki + ko_ 

^-{n-fco)/2 

h + i '■ 



ki + l n-ko ki + 3 {I - QiQ^)iki + ko 



1 + n 



where 2-^1 is the standard hypergeometric function [see Abramowitz and 
Stegun (1964)], and Qio = SSEj / SSEq is the ratio of the sum of squared 
errors of models Mj and Mq. 

To implement Bayesian model selection through (2), one also needs the 
prior odds ratios PjQ. A recommended objective Bayesian choice of these 
odds ratios for the variable selection problem is PjQ = kj\{p — kj)\/p\. For 
extensive discussion and earlier references, see Scott and Berger (2010). 

APPENDIX 

A.l. Approximations to improper priors. 



Lemma 1. Consider Tri{a) = Ciipi{a) , where ipi{a) increases monotoni- 
cally in i to 'k{ol) and q = 1/ J" ipi{a) da < 00. Then, if J //(y | a)Tr{a) da < 
00 for all densities fi{y \ a), 

/ /;(y I a)TTi{a)da _ J fijy \ a)Tr{a)da 
i^oo f fi,{y I a)7ri{a)da f /;/(y | a)TT{a)da' 
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Proof. 

/ /;(y I a)TTi{a)da _ J fi{y\ a)7pi{a)da ^ J fijy \ a)TT{a)da 

J fi'{y\a)7ri{a)da J fi>(y \ a)'ipi{a) da J fi/{y \ a)7r{a) da 

by the monotone convergence theorem. □ 

Thus common proper priors can be used to approximate common im- 
proper priors and, as the approximation improves, the Bayes factors for the 
proper priors converge to the Bayes factor for the improper prior; this is why 
Bayesians have always said that it is not ihogical to use an improper prior 
for a common parameter a in computing a Bayes factor. It is interesting 
that no conditions are needed in the lemma, except that the marginal like- 
lihoods exist for the improper prior, which is clearly needed for the Bayes 
factor to even be defined for the improper prior. 

A. 2. Proof of Proposition 1. This proof requires the following lemma: 



Lemma 2. If m> 1, p> 0, a> and k>l, then 

Proof. For < e < 1, write 

lim z'^+^'A"-! ( \~Wim~x)yp-z 
\m — A I 



k) 



2— ^-OO 



e / ^ \k 



(27) = lim / z"+'=A"-if ] e-(^/('"-^»-f^(iA 

z^co Jq \m — a J 

'■I / \ \k 



+ lim I z^+^V-^i e-(^/('"-^))-P-^dA. 

z->-coJ^ \m — A J 



Note that 



lim ^«+'=A'^-if^-Ve-(^/('"-^»^'-^ = 0, 
2-5-00 \m — A J 

and the integrand in the last integral in (28) is uniformly bounded over A 
and z. It follows from the dominated convergence theorem that the last term 
is zero, so that 

lim ['z^+f^x--^ ( ^-]\-w('--^)yp-- dX 

\m — X 



2— >00 







(28) ^ ^ 

m — X 



lim J%-+>^X--'(^^ ] e-(V(™-A)) P-^~dA. 



2— >-00 
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Next, make the change of variables t = A/(m — A) to get 

\m-Xj Jo (l+t)«+i 

To bound the integral of interest notice that, for t € (0,e/(m — e)), 

1 1 
(l + e/(m-e))'^+i " + " ^' 

By integrating t out from (29) and multiplying the result by z^'^~^''\ we get 
both an upper and a lower bound for the integral of interest, namely 

m°p-("+^)(r(a + fc) - r(a + fc, (£/(m - e))pz)) 
(l+e/(m-e))°+i 

i-e/(nn—e) j.k+a—1 

(30) < lim m'* / -— e-^P'^dt 







(l+t)«+l 



< m V^''^^'^ (r(a + A:) - r (^a + A;, ■ 



pz 

m — e 

where r(i/i,z^2) is the incomplete gamma function, 



poo 

r{iyi,U2)= / f'^-^e-'dt, 

J 1/2 



which goes to zero as 1^2 goes to infinity. 
Taking limits in 30 as z —> 00 gives 

(l + e/(m-e))«+i -^^oo (l + t)«+i 

< m''p~(''+'')r(a + /c). 

The result follows from (28) and the fact that the upper and lower bound 
are equal as e goes to 0. □ 

Continuing with the proof of Proposition 1, we remove the subindex i for 
simplicity in notation. Since the multivariate Student density can be written 
as 

St,{f3 I 0,C*,2a) = Ii^±M(2vr)"^-/2((„r(„))i/Vp(6 + n))-'=/2 

r(a) 

X |V*V|^/2(l + (2(ar(a))'/V'(^' + n))"'ll/?f)"^"^''^'\ 
it can be easily shown that 

Stk{P\0,C*,2a) 

||/3||™oo r(a + A;/2)(27r)-fc/2a(a2p(& + 7z))«| V*V|i/22»+fc/2(||/3||2)-(a+fc/2) 
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2(ar(a))^/V2(5 + n) 



>oo 

It then follows that 



-(a+k/2) 



X hm l + (2K(a))VVa^^(fc + n))-^U„ ^ 

llfll|2_ 



lim '^"^""•^ 



oo 



T{a + k/2){p{b + n))<' 

||/3||2^oo"' Jo \m-Xj 

where m = {p{b + n))/b and p = l/(2cr^6). Since p > b/{b + n) and m > 1, we 
can apply Lemma 2, and the result follows. 

A. 3. Proof of Result 1. To apply invariance, let 6 = {l3Q,a, (3^,1) de- 
note the parameter indexing all the models, and consider the location-scale 
group defined by g = (c,b) € Go = (0,oo) x TZ'"' acting on y through the 
transformation y = cy + Xob. It can be easily seen that y ~ /(• | 6*), where 
9* = (/3o,o"*,/9i with /3o = b + c/3o, cr* = ca, f3* = c/3j and i* = i, so 
that the transformed model has exactly the same structure as the original 
model. The Invariance-criterion thus says that the prior '7rj(/3j \ f3Q,a) must 
be such that the marginal models in (5) are invariant with respect to the 
group action, so that (keeping to the notation above) 

/(y 1 13*0, ^*,n = J A/'n(y I + X/3*, (a*)'l)7ri(/3* | (3*o, a*) d(3*, 

the fact that 7rj(- | •,•) must have the same functional form as in the orig- 
inal parameterization following from the completeness of A/'n(y | Xo/Qq + 
X/3*, (cT*)^I), given that the design matrix is of full rank. But one can also 
compute /(y I /3o, cr*, i*) by change of variables from the original density, 
yielding 

/(y|/3S,a*,i*) 

= j AA„(y|Xo/3S + X/3*,K)2l)7r,(/3*/c|(/3S-b)/c,a*/c)c-*^°d/3*. 

Again using the completeness of the normal density, these two expressions 
can be equal only if 

TTiiP* I f3l, a*) = mipyc I {(3*0 - h)/c,a*/c)c~^°. 

This condition is satisfied by the conditional prior in (12). 
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With respect to the only "if" part of the proof, note that for the partic- 
ular transformation in Gq given by b = /3o and c = a* , the above condition 
becomes 

vr,(/3*j/35,a*) = a-^»^(/37a*|0,l), 
proving that being of the form in (12) is also a necessary condition. 

A.4. Proof of Result 2. With the use of the full conditional for /3j asso- 
ciated with this prior, the integrated models can be alternatively expressed 
as 

M/:Y* = Xo/3o + cTe, 

where e ~ //(u), given by 

//(u) = y"AA„(u|X,t,I)/ii(t)dt (i>0) and /o'(u) = A4(u | 0, 1). 

This model selection problem was explicitly studied in Berger, Pericchi and 
Varshavsky (1998), where it was shown that the minimal sample size as- 
sociated with the right-Haar prior for (/3o,o") is n* = /cq + 1, and that it is 
sufficient for exact predictive matching for //(•) [or, equivalently, hi[-)] to 
be symmetric about the origin. 

A. 5. Proof of Result 3. It is convenient to work in terms of orthogonal 
parameters, so, for each model Mi, define 7 = /3o + (XQXo)~^XQXj/3j; this 
will be "common" to all models and orthogonal to (3^ in each model Mj, 
which can be written in the new parameterization as y ~ A/"n(y | Xo'y + 
Vj/3j, (j^In). Consider a scale mixture of normals prior of the form 

/•oo 

niPi \-f,a) = TT{Pi \a)= Nk, (/3i | 0, ga^Ai)h{g) dg. 

Jo 

Noting that the right-Haar prior for {ct,cr) transforms into the same prior 
(1/cj) for {-f,a), it follows that the marginal likelihood under model Mj is 

miy) = j Mn{y \ X07 + Vi/3i, a^I^)^- Vi(/3, | 7, a) a!(/3i, 7, a) 

/"OO /" 

= y AA„(y|Xo7 + Vi/3„c72l„)a"i 

xMkS(3i I 0,ga^Ai)h{g)d{(3i,j,a)dg. 

Using the fact that y*Vj(V- Vj)~^ V*y = SSEq for any sample of size n = 
ki + ko and integrating out 7, (3^ and a yields 

x{Pl[iVivr'+9A^r'^r''^'T(^^yig)d{g). 
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For the robust prior, Ai = (V*Vi)^\ and it follows that 




which is the same for all models of dimension ki , establishing that the robust 
prior is dimension predictive matching for sample sizes kQ + ki. Furthermore, 
this last expression equals mo(y) (see Appendix A. 6), establishing that the 
robust prior is null predictive matching for samples of size kQ + ki. [Note 
that this result would hold for any proper choice of h{g), not just that for 
the robust prior.] 

To see that null predictive matching does not occur if Aj is not a multiple 
of (V*Vi)-i, note that the expression to be established for null predictive 
matching is (eliminating multiplicative constants) 



Since (V,-Vj) ^ and Aj are positive definite, there is a matrix B such that 
B*(V*Vi)^^B = I and B*AjB = D, with D being a diagonal matrix with 
diagonal elements di. Also defining W = B*/3j, it follows that the above 
expression can be written 



Let dj be the largest diagonal element, and choose W to be the unit vector 
in coordinate j. Then the above expression becomes 



But the integrand is clearly greater than 0, unless all di are equal which is 
equivalent to the statement that Aj is a multiple of (V*Vj)~^. 

A.6. Computation of the Bayes factor in (16). 

Proposition 2. For any {a,b,pi) satisfying (11) and n>ki + k^, the 
prior predictive distribution for y under Mi using the robust prior is 








(n-fco)/2 2a 



[pj(n + 6)]-''/'AP. 



ki + 2a 



where 



<(y) 



i^-(n-fco)/2|x*Xor^/'r 



n — ki 



■0 



SSE, 



(n-fco)/2 



2 



'0 
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and APj defined in (3.4-1)- Hence the Bayes factor obtained with prior irf' 
in (8) can he compactly expressed as in (16). 

Proof. It is convenient to carry out the proof in the orthogonal trans- 
formation of the parameters as in Appendix A. 5. Using standard normal 
computations, the prior predictive distribution under Mq is 



JW=o Jo ^ 



l^-(n-fco)/2|x*Xor'/'r 



n — ko 



SSE, 







-(n-fco)/2 



Integrating out /3j, 7 and a, the prior predictive distribution under Mj is 
"if (y) = / ^^(y I X07 + Vi/3„a2l„)A4,(/3i I 0,B{X))aX''-'a~U{-f,f3„a,\) 



l^-(n-fco)/2|x*Xor'/'r 



n — h 



X (SSE,(pi(6 + n)-h\) + \ SSEo)"^""''°^/^ dA, 
with B(A) = {\^^ pi{h + n) — 6)(T^(V*Vj)^^. This expression can be rewritten 



as 



X 1 



I) _ I \ in-ki-ko)/2 



Pi{b + n)' 



Pi{b + n) 

and the result follows by noting that 



^_g-^l^X-(n-fco)/2 



dA, 



APi 



2a + ki 



}j _\ \ {n~ki~ko)/2 



Pi{b + n) 



-{n-fco)/2 



Pi{b + n 

A.7. Proof of Corollary 1. For the prior in (8) 
/ {l + g)-''^V{g)dg = 

Jo Jo 



dX. 



□ 



p^{b+n)-b + 
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The change of variables z = g — [pi{b + n) — b] results in 



poo 

Jo 



a[pi{b + n)]'' 



■ dz. 



lo [z + pi{b + n)](''+^)[l + z + pi{b + n) - bf'/^ 

It is now easy to see that if pi{b + n) goes to oo with n, this integral vanishes 
as oo, satisfying the condition of Result 5. 

A. 8. Proof of Result 7. For simplicity, the explicit dependence of Qio 
on Ym will not be shown in this proof, and limm-i>oo Qio(ym) = will be 
denoted by Qio — > 0. The robust Bayes factor can be written as 



Pi(b + n) 



A 



(n-ki-ko)/2 



b-Q 



Pi{b + n) 
-1 



-(n-feo)/2 



d\ 



,(p.(„ + 6))"'=«/2 / ^a+{fc./2)-l 







1 



Q 



iO 



1 



Pi(b + n) 
b\ 



{n-ki-ko)/2 



+ 



A 



Pi[b + n)J pi{b + n) 
Note that, since 6 > 0, > 6/(& + n) and < A < 1, 



-(n-fco)/2 



d\. 



min<; 1,^ < 



1 



6-1 
Pi(6 + n) 



A 



, 1 

< max<; 1, - 



and 



A 



Pi{b + n) 



< 



Q 



iO 



1 



bX 



+ 



A 



Piib + n) 



< 



Pi[b + n 

Applying these bounds, it is immediate that 

ya+{k./2)~l [^^g.^ ^ ^p(n-fco)/2 



A 



Pi{b + n) 



Cl 



(31) 



for positive constants ci, C2 and C3. 

To prove the "only if" part of the proposition, note that the last integral 
in (31) is finite if n < /cj + /cq + 2a. Hence BiQ is bounded by a constant as 
QiQ — > 0, and information consistency does not hold. 
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To prove the "if" part of the proposition, make the change of variables 
A* = X/Qio in the lower bound in (31), resulting in the expression 



If n > /cj + /cq + 2a, it is clear that this expression goes to infinity as Qio — )■ 
(since the integral itself cannot go to 0). If n = /cj + /cq + 2a, the expression 
becomes 
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